For another scientist:
This is a bioinformatic pipeline that starts with multiple fastq files provided by sequencing company JonahVentures. The project is Lionfish gut content metabarcoding of COI on mtDNA. Each fastq files are the sequences found within the gut of one individual fish. Reference file contains sequences from visually identified fishes collected in the same transect as the lionfish.
For any person off the street:
I am using DNA to identify partially digested material found in the stomachs of lionfish to understand what they are eating. This code will identify the DNA sequences from the stomachs based off of sequences from visually identified species collected in the same area as the lionfish.
Load the necessary packages
library (tidyverse)
install.packages('insect')
library (insect)
#if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("dada2")
library (digest)
library (seqinr)
library(lubridate)
library(ShortRead)
library(dada2); packageVersion("dada2")
library(dplyr)
Path to fastq files before trimming primers
path <- "../Data/"
list.files(path)
Sorting forward and reverse reads based on file name pattern
fnFs <- sort(list.files(path, pattern=".R1.fastq", full.names = TRUE))
fnRs <- sort(list.files(path, pattern=".R2.fastq", full.names = TRUE))
sample.names <- str_replace(basename(fnFs), ".1.R1.fastq","")
This shows a plot of the quality of the forward reads
In gray-scale is a heat map of the frequency of each quality score at each base position. The mean quality score at each position is shown by the green line, and the quartiles of the quality score distribution by the orange lines. The red line shows the scaled proportion of reads that extend to at least that position (this is more useful for other sequencing technologies, as Illumina reads are typically all the same length, hence the flat red line).
plotQualityProfile(fnFs[1:5])
This plot shows quality of the reverse reads
plotQualityProfile(fnRs[1:5])
This is the path to cut adapt. I don’t have permission to run on raven but I can run this on my laptop with appropriate directory
cutadapt <- "../Applications/cutadapt.exe"
These are the COI primers. Taken from methods txt file from Jonah ventures. The Function allOrients makes forward, compliment, reverse, and reverse compliment for both primers. This is to look for the primers in the sequences in any orientation (PrimerHits, the next function/ chunck)
FWD <- "GGWACWGGWTGAACWGTWTAYCCYCC"
REV <- "TANACYTCNGGRTGNCCRAARAAYCA" #N's replaced I's from Jonah
allOrients <- function(primer) {
require(Biostrings)
dna <- DNAString(primer)
orients <- c(Forward = dna, Complement = complement(dna), Reverse = reverse(dna), RevComp = reverseComplement(dna))
return(sapply(orients, toString))
}
FWD.orients <- allOrients(FWD)
FWD.orients
REV.orients <- allOrients(REV)
REV.orients
PrimerHits function is looking for the primers in all orientations in the sequences.
primerHits <- function(primer, fn){
nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
return(sum(nhits > 0))
}
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs[[1]]),
REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs[[1]]))
This checks that we can use cutadapt
system2(cutadapt, args = "--version")
Running cutadapt. Permission denied on raven but works on laptop
path.cut <- file.path(path, "cutadapt")
if(!dir.exists(path.cut)) dir.create(path.cut)
fnFs.cut <- file.path(path.cut, basename(fnFs))
fnRs.cut <- file.path(path.cut, basename(fnRs))
R1.flags <- paste0("-g", " ^", FWD)
R2.flags <- paste0("-G", " ^", REV)
for(i in seq_along(fnFs)) {
system2(cutadapt, args = c(R1.flags, R2.flags,
"-m 1", #discards reads having length zero bp
"--discard-untrimmed",
"-o", fnFs.cut[i], "-p", fnRs.cut[i], #output files
fnFs[i], fnRs[i])) #input files
}
Path to fastq files after trimming primers
path.cut <- "../Data/cutadapt"
list.files(path.cut)
Sorting forward and reverse reads based on file name pattern. save as objects
fnFs_cut <- sort(list.files(path.cut, pattern=".R1.fastq", full.names = TRUE))
fnRs_cut <- sort(list.files(path.cut, pattern=".R2.fastq", full.names = TRUE))
sample.names <- str_replace(basename(fnFs), ".1.R1.fastq","")
This shows a plot of the quality of the forward reads after removing primers
plotQualityProfile(fnFs_cut[1:5])
This plot shows quality of the reverse reads after removing primers
plotQualityProfile(fnRs_cut[1:5])
filtF1s <- file.path(path, "filtered", paste0(sample.names, ".R1.filt.fastq.gz"))
filtR1s <- file.path(path, "filtered", paste0(sample.names, ".R2.filt.fastq.gz"))
names(filtF1s) <- sample.names
names(filtR1s) <- sample.names
out <- filterAndTrim(fnFs_cut, filtF1s, fnRs_cut, filtR1s, truncLen=c(200,200),
maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE,
compress=TRUE, multithread=TRUE) # On Windows set multithread=FALSE
head(out)
Checking the quality after filtering and trimming
plotQualityProfile(filtF1s)
plotQualityProfile(filtR1s)
errF1 <- learnErrors(filtF1s, multithread=TRUE,verbose = 0)
errR1 <- learnErrors(filtR1s, multithread=TRUE,verbose = 0)
visualise the estimated error rates
plotErrors(errF1, nominalQ=TRUE)
derepF1s <- derepFastq(filtF1s, verbose = TRUE)
derepR1s <- derepFastq(filtR1s, verbose = TRUE)
dadaF1s <- dada(derepF1s, err = errF1, multithread = TRUE)
dadaR1s <- dada(derepR1s, err = errR1, multithread = TRUE)
dadaF1s[[1]]
mergers <- mergePairs(dadaF1s, filtF1s, dadaR1s, filtR1s, verbose=TRUE)
# Inspect the merger data.frame from the first sample
head(mergers[[1]])
construct an amplicon sequence variant table (ASV) table, a higher-resolution version of the OTU table produced by traditional methods.
seqtab <- makeSequenceTable(mergers)
dim(seqtab)
The sequence table is a matrix with rows corresponding to (and named by) the samples, and columns corresponding to (and named by) the sequence variants.
#Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab)))
The core dada method corrects substitution and indel errors, but chimeras remain. Fortunately, the accuracy of sequence variants after denoising makes identifying chimeric ASVs simpler than when dealing with fuzzy OTUs. Chimeric sequences are identified if they can be exactly reconstructed by combining a left-segment and a right-segment from two more abundant “parent” sequences.
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE)
dim(seqtab.nochim)
sum(seqtab.nochim)/sum(seqtab)
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaF1s, getN), sapply(dadaR1s, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace sapply(dadaFs, getN) with getN(dadaFs)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
head(track)
In theory, one command takes the output from above and assigns taxonomy based on a reference file
taxonomy <- assignTaxonomy(
seqtab.project.miseqrun1,
"../Data/ALL_Atlantic_2_Feb_2017.fasta",
taxLevels = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "species"),
tryRC = FALSE,
minBoot = 50,
outputBootstraps = TRUE,
multithread = TRUE,
verbose = TRUE
)
However, our input file is not formatted correctly, and it give this error:
Error in assignTaxonomy(seqtab.project.miseqrun1, “../Data/ALL_Atlantic_2_Feb_2017.fasta”, : Incorrect reference file format for assignTaxonomy.
Here is the original file
head -n 15 ../Data/ALL_Atlantic_2_Feb_2017.fasta | tail -n 5
## >0089551016__Eucinostomus_
## NNNNNNNNNNNNNNNNNNNNNNNCCTTTACCTCATCTTT-GGTGCCTG-AGCT-GGTATGGT-AGGGACAGCCCTTAGCCTA-CTTATCCGTGCCGAGCTAAGCCAACCCGGTTCTCTCCTAGGTGATGACCAGATCTATAATGTTATTGTTACAGCTCACGCATTTGTAATAATTTTTTTTATAGTAATACCTATCATAATTGGAGGCTTCGGGAACTGACTAATCCCCCTAATGATTGGGGCACCTGACATAGCATTCCCTCGAATAAATAACATGAGCTTCTGGCTTCTTCCCCCCTCTTTTATTCTTCTCCTAGCCTCTTCAGGCGTAGAAGCAGGGGCCGGTACCGGATGAACAGTTTACCCTCCCCTGGCAGGTAATTTAGCACATGCGGGGGCATCCGTAGACTTAACTATCTTTTCTCTCCACCTTGCTGGGATCTCATCAATCCTAGGGGCAATTAACTTTATTACAACCATTATTAACATAAAACCCCCTGCCATTTCCCAGTACCAAACACCTTTATTCGTTTGAGCTGTTTTAATCACGGCTGTCCTGCTTCTTCTTTCCCTCCCC-GTTCTAGCAGCCGGTATTACTATACTTCTAACAGACCG-AAACTTAAATACTACATTTTTTGACCC-T-GCAG-GGGGAGGTGA--CCCTATCCTCTATCAACATCTCTTCNNNNNNNNNN
## >BZLWA189_06|JQ840067|BZLW4189|Gerreidae
## NNNNNNNNNNNNNNNNNNNNNNNCCTTTACCTCATCTTT-GGTGCCTG-AGCT-GGTATGGT-AGGGACAGCCCTTAGCCTA-CTTATCCGTGCCGAGCTAAGCCAACCCGGTTCTCTCCTAGGTGATGACCAGATCTATAATGTTATTGTTACAGCTCACGCATTTGTAATAATTTTTTTTATAGTAATACCTATCATAATTGGAGGCTTCGGGAACTGACTAATCCCCCTAATGATTGGGGCACCTGACATAGCATTCCCTCGAATAAATAACATGAGCTTCTGGCTTCTTCCCCCCTCTTTTATTCTTCTCCTAGCCTCTTCAGGCGTAGAAGCAGGGGCCGGTACCGGATGAACAGTTTACCCTCCCCTGGCAGGTAATTTAGCACATGCGGGGGCATCCGTAGACTTAACTATCTTTTCTCTCCACCTTGCTGGGATCTCATCAATCCTAGGGGCAATTAACTTTATTACAACCATTATTAACATAAAACCCCCTGCCATTTCCCAGTACCAAACACCTTTATTCGTTTGAGCTGTTTTAATCACGGCTGTCCTGCTTCTTCTTTCCCTCCCC-GTTCTAGCAGCCGGTATTACTATACTTCTAACAGACCG-AAACTTAAATACTACATTTTTTGACCC-T-GCAG-GGGGAGGTGA--CCCTATCCTCTATCAACATCTCTTCNNNNNNNNNN
## >BZLWA190_06|JQ840070|BZLW4190|Gerreidae
Translate command to make | into ;
tr '|' ';' < ../Data/ALL_Atlantic_2_Feb_2017.fasta \
> ../Data/ALL_Atlantic_2_Feb_2017_edit.fasta
Translate commands to remove returns
tr '
' ';' < ../Data/ALL_Atlantic_2_Feb_2017_edit.fasta \
> ../Data/ALL_Atlantic_2_Feb_2017_edit2.fasta
Sed command to put returns where I want them
sed 's/>/\'$'\n/g' < ../Data/ALL_Atlantic_2_Feb_2017_edit2.fasta \
> ../Data/ALL_Atlantic_2_Feb_2017_edit3.fasta
sed command to put > back at beginning of each line
sed 's/^/>/' < ../Data/ALL_Atlantic_2_Feb_2017_edit3.fasta \
> ../Data/ALL_Atlantic_2_Feb_2017_edit4.fasta
Here is how the referece file looks now
head -n 15 ../Data/ALL_Atlantic_2_Feb_2017_edit4.fasta | tail -n 3
## >SMSA146_09;JQ842451;SMSA7146;Eucinostomus;NNNNNNNNNNNNNNNNNNNNNNNCCTTTACCTCATCTTT-GGTGCCTG-AGCT-GGTATGGT-AGGGACAGCCCTTAGCCTA-CTTATCCGTGCCGAGCTAAGCCAACCCGGTTCTCTCCTAGGTGATGACCAGATCTATAATGTTATTGTTACAGCTCACGCATTTGTAATAATTTTTTTTATAGTAATACCTATCATAATTGGAGGCTTCGGGAACTGACTAATCCCCCTAATGATTGGGGCACCTGACATAGCATTCCCTCGAATAAATAACATGAGCTTCTGGCTTCTTCCCCCCTCTTTTATTCTTCTCCTAGCCTCTTCAGGCGTAGAAGCAGGGGCCGGTACCGGATGAACAGTTTACCCTCCCCTGGCAGGTAATTTAGCACATGCGGGGGCATCCGTAGACTTAACTATCTTTTCTCTCCACCTTGCTGGGATCTCATCAATCCTAGGGGCAATTAACTTTATTACAACCATTATTAACATAAAACCCCCTGCCATTTCCCAGTACCAAACACCTTTATTCGTTTGAGCTGTTTTAATCACGGCTGTCCTGCTTCTTCTTTCCCTCCCC-GTTCTAGCAGCCGGTATTACTATACTTCTAACAGACCG-AAACTTAAATACTACATTTTTTGACCC-T-GCAG-GGGGAGGTGA--CCCTATCCTCTATCAACATCTCTTCNNNNNNNNNN;
## >0089551017__Atherinidae_;NNNNNNNNNNNNNNNNNNNNNNNCCTCTATCTAGTATTT-GGTGCTTG-AGCC-GGAATGGT-AGGCACTGCCTTAAGCCTT-CTAATTCGGGCAGAATTAAGCCAACCAGGCTCTCTCCTTGGAGATGACCAAATTTACAATGTAATCGTTACAGCACACGCCTTTGTAATAATTTTCTTTATAGTAATACCAATTATGATTGGAGGATTTGGAAACTGACTTATTCCCCTTATGATTGGAGCCCCTGATATGGCATTTCCCCGAATGAACAATATGAGCTTCTGACTCCTCCCTCCCTCTTTTCTTCTCCTTCTTGCATCATCTGGAGTTGAAGCCGGAGCTGGTACAGGCTGAACAGTCTACCCCCCTCTAGCAGGAAACCTAGCTCACGCAGGTGCATCTGTAGATCTTACCATCTTTTCTCTCCACCTAGCAGGTGTTTCATCAATTCTAGGGGCTATCAACTTCATTACAACTATTATTAATATGAAACCCCCTGCAATCTCACAATACCAGACACCCCTGTTCGTTTGAGCTGTCTTAATTACTGCCGTTCTTCTCCTTCTTTCCCTTCCA-GTTCTGGCAGCTGGCATCACTATACTACTAACAGACCG-AAATCTAAATACCACCTTCTTCGACCC-T-GCCG-GAGGGGGTGA--TCCCATTCTGTATCAACATCTCTTCNNNNNNNNNN;
## >0089551020__Ogilbia_;NNNNNNNNNNNNNNNNNNNNNNNCCTCTACCTGGTATTC-GGTGCTTG-GGCA-GGAATAGT-AGGCACAGCACTAAGCCTC-CTTATTCGGGCAGAACTGAGCCAGCCCGGCTCTCTCCTTGGAGATGACCAAATTTATAACGTCATCGTCACAGCCCACGCCTTCGTAATAATTTTCTTTATAGTAATGCCAATCATGATCGGTGGATTTGGAAACTGACTTATTCCCCTGATGATTGGAGCCCCCGATATAGCGTTTCCCCGAATAAATAATATGAGCTTCTGGCTACTACCCCCATCCTTCCTTCTCCTTCTTGCCTCTTCCGGAGTAGAAGCAGGGGCAGGGACCGGCTGAACAGTCTACCCGCCCTTAGCAGGCAACCTCGCCCATGCAGGAGCCTCCGTTGACTTAACCATCTTCTCCCTTCACCTAGCCGGAGTTTCCTCCATCCTAGGAGCAATTAATTTCATCACAACAATTATTAACATGAAACCCCCAGCCATTTCCCAATACCAAACCCCTTTATTCGTGTGAGCCGTGTTAATTACAGCCGTTCTCCTTCTCCTATCCCTCCCG-GTCCTTGCCGCTGGTATCACCATACTATTAACAGACCG-AAATTTAAACACAACATTCTTCGACCC-C-GCTG-GTGGAGGTGA--CCCAATCTTATACCAACACCTATTCNNNNNNNNNN;
assignTaxonomy function should then assign taxonomy after editing the reference file to the appropriate format
taxonomy <- assignTaxonomy(
seqtab.project.miseqrun1,
"../Data/ALL_Atlantic_2_Feb_2017_edit4.fasta",
taxLevels = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "species"),
tryRC = FALSE,
minBoot = 50,
outputBootstraps = TRUE,
multithread = TRUE,
verbose = TRUE
)
BUT IT STILL DIDN’T WORK even after editing the reference file. It returns this error:
Error in tax[[1]] : subscript out of bounds
Which appears to mean some sequences in the reference file are shorter than some desired length. So let’s just Blast like normal. A for effort.
Shout out to Halia cause I took this from her.
Check if the softwear is working and you can access the help menu
../Applications/ncbi-blast-2.13.0+/bin/blastx -h
make the blast database
../Applications/ncbi-blast-2.13.0+/bin/makeblastdb \
-in ../Data/ALL_Atlantic_2_Feb_2017.fasta \
-dbtype nucl \
-out ../Output/Reference_DB
Write output from dada pipline (seqtab.nochim) to a fasta file that can be blasted transpose table
seqtab.nochim_trans <- as.data.frame(t(seqtab.nochim)) %>% rownames_to_column(var = "sequence") %>%
rowid_to_column(var = "OTUNumber") %>% mutate(OTUNumber = sprintf("otu%04d",
OTUNumber)) %>% mutate(sequence = str_replace_all(sequence, "(-|\\.)", ""))
convert seqtab.nochim_trans to fasta file and double check where it goes in the directory
df <- seqtab.nochim_trans
seq_out <- Biostrings::DNAStringSet(df$sequence)
names(seq_out) <- str_c(df$OTUNumber, df$Supergroup, df$Division, df$Class,
df$Order, df$Family, df$Genus, df$Species, sep = "|")
Biostrings::writeXStringSet(seq_out, str_c( "Test_1_ASV.fasta"), compress = FALSE,
width = 20000)
examine fasta file
head ../Data/Test_1_ASV.fasta
## >otu0001
## TCTATCAGGCAACCTAGCCCACGCAGGTGCTTCCGTAGACTTAACCATCTTCTCCCTCCACCTGGCAGGTATTTCTTCAATCCTGGGAGCCATCAACTTTATCACTACCATTATTAATATGAAACCCCCTGCCATTTCCCAGTATCAAACCCCTCTTTTCGTATGGGCTGTTCTTATTACTGCCGTACTTCTCCTTCTGTCCCTCCCAGTCTTAGCTGCTGGCATCACAATACTTCTAACTGATCGAAACCTAAACACTACATTCTTTGACCCTGCGGGAGGAGGTGACCCAATCCTTTATCAACACCTGTTC
## >otu0002
## CTTAGCGGGCAATCTTGCCCATGCCGGGGCATCTGTCGACCTAACAATTTTCTCCTTGCACTTAGCAGGCATTTCATCAATCCTGGGGGCAATCAATTTTATTACAACAATTATTAATATAAAACCCCCAGCTATCTCCCAGTACCAAACTCCACTGTTTGTATGAGCTGTCTTAATTACGGCAGTTCTTTTACTTCTTTCGCTCCCAGTCCTTGCCGCCGGTATTACAATACTGCTTACTGATCGAAACCTCAACACCACCTTCTTTGACCCAGCGGGGGGAGGAGACCCAATTCTTTACCAACACCTCTTC
## >otu0003
## TCTATCAGGCAACCTAGCCCACGCAGGTGCTTCCGTAGACTTAACCATCTTCTCCCTCCACCTGGCAGGTATTTCTTCAATCCTGGGAGCCATCAACTTTATCACTACCATTATTAATATGAAACCCCCTGCCATTTCCCAGTATCAAACCCCTCTTTTCGTATGGGCTGTTCTTATTACTGCCGTACTTCTCCTTCTATCCCTCCCAGTCTTAGCTGCTGGCATCACAATACTTCTAACTGATCGAAACCTAAACACCACATTCTTTGACCCTGCGGGAGGAGGTGACCCGATCCTTTATCAACACCTGTTC
## >otu0004
## CTTAGCGGGCAATCTTGCCCATGCCGGGGCATCTGTCGACCTAACAATTTTCTCCTTGCACTTAGCAGGCATTTCATCAATCCTGGGGGCAATCAATTTTATTACAACAATTATTAATATAAAACCCCCAGCTATCTCCCAGTACCAAACTCCACTGTTTGTGTGAGCTGTCTTAATTACGGCAGTTCTTTTACTTCTTTCGCTCCCAGTCCTTGCCGCCGGTATTACAATACTGCTTACTGATCGAAACCTCAACACCACCTTCTTTGACCCAGCGGGGGGAGGAGACCCAATTCTTTACCAACACCTCTTC
## >otu0005
## TGGCTGGAAACCTAGCCCACGCAGGCGCGTCTGTTGACTTAACAATTTTCTCCCTTCACCTTGCAGGTGTTTCTTCGATTCTTGGTGCAATTAATTTTATTACCACAATTATTAATATGAAGCCCCCAGCCATCTCTCAGTATCAAACGCCTTTATTTGTTTGAGCAACCCTAATTACAGCTGTCCTGCTTCTTCTTTCCCTTCCTGTTTTAGCTGCTGGCATCACAATGCTTCTCACAGACCGAAACCTTAACACCACTTTCTTCGACCCGGCAGGAGGAGGAGACCCAATCCTATACCAACACCTCTTC
../Applications/ncbi-blast-2.13.0+/bin/blastn \
-query ../Data/Test_1_ASV.fasta \
-db ../Output/Reference_DB \
-out ../Output/Test_1_ASV.tab \
-num_threads 8 \
-max_target_seqs 1 \
-outfmt 6
Examine blast output
head -2 ../Output/Test_1_ASV.tab
wc -l ../Output/Test_1_ASV.tab
## otu0001 DRRA1971_12|DRR110186|Chromis_multilineata 97.500 320 1 6 1 313 371 690 4.75e-154 540
## otu0002 LFG6_01__201780231_ 97.812 320 0 6 1 313 371 690 1.02e-155 545
## 14 ../Output/Test_1_ASV.tab
This is really ugly, but I think it technically worked.
This is the desired output from the bioinformatic pipeline. Some columns were omitted for brevity (for example, the columns with the sequence, kingdom, and phylum). ESVid is the ID of the Estimated Sequence Variance. Class Order, Family, Genus, and Species refer to the taxonomic identification of the ESV. X..Match refers to the percent match. Columns with names following S0434##.# format refer to the sample (individual lionfish gut) Numbers in the matrix refer to how many sequences of the given taxa were found in the sample, with zero being not found in that sample. Number of sequences does not correlate with the abundace or amount of prey from a species in the stomach. We can only infer presence/ absence and frequency of occurance of each taxon.
Taxon Pterois volitans (Red Lionfish) reads removed under the assumption that these reads are from the stomach lining of the individual and not representative of conspecific consumption (canabalism).
Data interpretation: We cannot assume more reads = more of that taxon in the sample. We can look at general presence/ absence or frequency of occurance.
table<-read.csv("../Output/JVB1606-UniCOI-read-data.csv") #Read in Table
table_no_lions <- table[-c(which(table$Genus=="Pterois")),] #Remove Lionfish reads
datatable(table_no_lions[,-c(1,3,4,5,18)]) #View table without a few columns